14  Week 14: Categorical Data Analysis

Introduction to Statistics for Animal Science

Author

AnS 500 - Fall 2025

Published

November 15, 2025

15 Introduction

Up to this point in the course, we’ve focused primarily on continuous outcomes: cattle weight gains, milk production, feed efficiency, and similar numeric measurements. But many important questions in animal science involve categorical outcomes:

  • Does a vaccine prevent disease? (Yes/No)
  • What coat color will offspring inherit? (Black/Red/White)
  • Do different housing systems affect lameness rates? (Lame/Not Lame)
  • Is there a relationship between breed and temperament? (Calm/Nervous/Aggressive)

When our outcome variable is categorical rather than continuous, we need different statistical tools. This week, we’ll explore methods for analyzing categorical data, from simple chi-square tests to logistic regression models.

15.1 A Motivating Example

Imagine you’re investigating bovine respiratory disease (BRD) in feedlot cattle. You’ve implemented a new vaccine protocol and want to know:

  1. Did it work? Are disease rates different between vaccinated and unvaccinated cattle?
  2. How much did it help? What’s the reduction in disease risk?
  3. What factors matter? Do age, weight, or previous health status affect disease risk?

Traditional t-tests and ANOVA won’t work here—our outcome is binary (disease: Yes/No), not continuous. We need categorical data analysis methods.

15.2 Key Questions We’ll Address

By the end of this week, you’ll be able to answer:

  • When should I use chi-square tests vs Fisher’s exact test?
  • How do I calculate and interpret odds ratios and risk ratios?
  • What’s the difference between association and causation with categorical data?
  • When do I need logistic regression instead of simpler tests?
  • How do I interpret logistic regression coefficients?
NoteBuilding on Previous Weeks

We’ve already covered hypothesis testing (Week 4) and ANOVA (Week 5). The logic is the same for categorical data:

  1. State hypotheses (null: no association; alternative: association exists)
  2. Calculate a test statistic
  3. Determine if results are more extreme than expected by chance
  4. Consider effect sizes and practical significance

The difference is in how we measure associations with categorical variables.

16 Understanding Categorical Variables

16.1 Types of Categorical Variables

Categorical variables come in different forms:

Nominal: Categories with no inherent order - Breed (Holstein, Jersey, Angus) - Coat color (Black, White, Brown) - Disease diagnosis (BRD, Pneumonia, Healthy)

Ordinal: Categories with a natural order - Body condition score (1, 2, 3, 4, 5) - Lameness score (None, Mild, Moderate, Severe) - Treatment response (Poor, Fair, Good, Excellent)

Binary: Special case with only two categories - Disease status (Yes/No) - Survival (Alive/Dead) - Treatment group (Control/Treated)

ImportantThe methods we cover this week work best for nominal and binary data

For ordinal data with many categories, specialized methods (ordinal logistic regression, Wilcoxon tests) may be more appropriate. However, chi-square tests can still be used as a first approximation.

16.2 Contingency Tables

A contingency table (also called a cross-tabulation) displays the frequency distribution of two or more categorical variables.

Let’s create a simple example:

Code
# Simulate vaccine trial data
vaccine_data <- tibble(
  animal_id = 1:200,
  vaccine = rep(c("Control", "Vaccinated"), each = 100),
  disease = c(
    sample(c("Yes", "No"), 100, replace = TRUE, prob = c(0.30, 0.70)),  # Control
    sample(c("Yes", "No"), 100, replace = TRUE, prob = c(0.15, 0.85))   # Vaccinated
  )
)

# Create contingency table
table_result <- table(vaccine_data$vaccine, vaccine_data$disease)
table_result
            
             No Yes
  Control    67  33
  Vaccinated 88  12

This is a 2×2 contingency table (2 rows × 2 columns). Let’s make it more informative:

Code
# Add row/column totals
addmargins(table_result)
            
              No Yes Sum
  Control     67  33 100
  Vaccinated  88  12 100
  Sum        155  45 200

We can also calculate proportions:

Code
# Proportions by row (vaccine status)
prop.table(table_result, margin = 1) %>%
  round(3)
            
               No  Yes
  Control    0.67 0.33
  Vaccinated 0.88 0.12

Interpretation: In this simulated data: - 31% of control animals got disease - 12% of vaccinated animals got disease - Difference: 31% - 12% = 19 percentage points

But is this difference statistically significant, or could it have occurred by chance?

17 Chi-Square Goodness of Fit Test

The chi-square goodness of fit test answers the question: Do observed frequencies match expected frequencies?

17.1 When to Use It

Use this test when you have: - One categorical variable - A hypothesis about the expected distribution - Want to test if your data fits that distribution

17.2 Classic Example: Mendelian Genetics

Suppose you’re breeding horses and expect a 3:1 ratio of black to chestnut coat color (simple dominant inheritance). You observe:

  • Black: 73
  • Chestnut: 27
  • Total: 100

Does this match the expected 3:1 ratio?

Code
# Observed counts
observed <- c(Black = 73, Chestnut = 27)

# Expected proportions (3:1 ratio)
expected_props <- c(0.75, 0.25)

# Chi-square goodness of fit test
chisq_result <- chisq.test(observed, p = expected_props)
chisq_result

    Chi-squared test for given probabilities

data:  observed
X-squared = 0.21333, df = 1, p-value = 0.6442
Code
# Tidy output
tidy(chisq_result)
# A tibble: 1 × 4
  statistic p.value parameter method                                  
      <dbl>   <dbl>     <dbl> <chr>                                   
1     0.213   0.644         1 Chi-squared test for given probabilities

Interpretation: - p-value = 0.644 - We fail to reject the null hypothesis - The observed data is consistent with a 3:1 ratio

17.3 How It Works

The chi-square test compares observed (O) and expected (E) frequencies:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]

Let’s calculate it manually:

Code
# Expected counts (3:1 ratio with n=100)
expected <- c(75, 25)

# Calculate chi-square statistic
chi_square_stat <- sum((observed - expected)^2 / expected)
chi_square_stat
[1] 0.2133333
Code
# Compare to test result
chisq_result$statistic
X-squared 
0.2133333 

The p-value comes from the chi-square distribution with degrees of freedom = k - 1 (where k = number of categories).

Code
# Visualize chi-square distribution
tibble(x = seq(0, 10, 0.01)) %>%
  mutate(density = dchisq(x, df = 1)) %>%
  ggplot(aes(x = x, y = density)) +
  geom_line(size = 1, color = "steelblue") +
  geom_vline(xintercept = chi_square_stat, color = "red", linetype = "dashed", size = 1) +
  geom_area(data = . %>% filter(x >= chi_square_stat),
            fill = "red", alpha = 0.3) +
  labs(title = "Chi-Square Distribution (df = 1)",
       subtitle = "Red area = p-value",
       x = "Chi-square statistic",
       y = "Density") +
  theme_minimal()

17.4 Assumptions

WarningChi-Square Test Assumptions
  1. Independent observations: Each animal counted once
  2. Expected cell counts ≥ 5: Rule of thumb for valid p-values
  3. Random sampling: Data should be representative

If expected counts < 5, the chi-square approximation may be poor. Consider: - Combining categories - Using exact tests - Collecting more data

17.5 More Complex Example: Dihybrid Cross

In a dihybrid cross (two genes), we expect a 9:3:3:1 ratio. Let’s test data from 160 offspring:

Code
# Observed counts
observed_dihybrid <- c(95, 28, 26, 11)
names(observed_dihybrid) <- c("Both_Dominant", "First_Dominant",
                               "Second_Dominant", "Both_Recessive")

# Expected proportions (9:3:3:1 ratio)
expected_props_dihybrid <- c(9, 3, 3, 1) / 16

# Chi-square test
chisq_dihybrid <- chisq.test(observed_dihybrid, p = expected_props_dihybrid)
tidy(chisq_dihybrid)
# A tibble: 1 × 4
  statistic p.value parameter method                                  
      <dbl>   <dbl>     <dbl> <chr>                                   
1      1.04   0.790         3 Chi-squared test for given probabilities
Code
# Visualize observed vs expected
tibble(
  Phenotype = factor(names(observed_dihybrid),
                     levels = names(observed_dihybrid)),
  Observed = observed_dihybrid,
  Expected = expected_props_dihybrid * sum(observed_dihybrid)
) %>%
  pivot_longer(cols = c(Observed, Expected),
               names_to = "Type", values_to = "Count") %>%
  ggplot(aes(x = Phenotype, y = Count, fill = Type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Observed" = "steelblue",
                                "Expected" = "coral")) +
  labs(title = "Dihybrid Cross: Observed vs Expected Counts",
       subtitle = paste0("χ² = ", round(chisq_dihybrid$statistic, 2),
                        ", p = ", round(chisq_dihybrid$p.value, 3)),
       x = "Phenotype", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation: The data fits the expected 9:3:3:1 ratio well (p = 0.79).

18 Chi-Square Test of Independence

The chi-square test of independence asks: Are two categorical variables associated?

18.1 The Null Hypothesis

  • H₀: The two variables are independent (no association)
  • H₁: The two variables are associated

18.2 Example: Vaccine Efficacy

Let’s return to our vaccine trial. Are vaccine status and disease outcome independent?

Code
# Recreate table for clarity
vaccine_table <- table(vaccine_data$vaccine, vaccine_data$disease)
vaccine_table
            
             No Yes
  Control    67  33
  Vaccinated 88  12
Code
# Chi-square test of independence
chi_vaccine <- chisq.test(vaccine_table)
chi_vaccine

    Pearson's Chi-squared test with Yates' continuity correction

data:  vaccine_table
X-squared = 11.47, df = 1, p-value = 0.0007075
Code
tidy(chi_vaccine)
# A tibble: 1 × 4
  statistic  p.value parameter method                                           
      <dbl>    <dbl>     <int> <chr>                                            
1      11.5 0.000707         1 Pearson's Chi-squared test with Yates' continuit…

Interpretation: - χ² = 11.47 - p-value = 7^{-4} - We reject the null hypothesis - There is an association between vaccination and disease status

18.3 What Does the Test Do?

Under independence, we can calculate expected counts for each cell:

\[ E_{ij} = \frac{(\text{row } i \text{ total}) \times (\text{column } j \text{ total})}{\text{grand total}} \]

Code
# Expected counts under independence
chi_vaccine$expected
            
               No  Yes
  Control    77.5 22.5
  Vaccinated 77.5 22.5

The chi-square statistic measures how different observed counts are from expected:

\[ \chi^2 = \sum_{all\ cells} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

18.4 Degrees of Freedom

For a contingency table: \[ df = (r - 1) \times (c - 1) \]

where r = number of rows, c = number of columns.

For our 2×2 table: df = (2-1) × (2-1) = 1

18.5 Visualizing Associations

Code
# Create visualization
vaccine_data %>%
  count(vaccine, disease) %>%
  group_by(vaccine) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = vaccine, y = prop, fill = disease)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(round(prop * 100, 1), "%")),
            position = position_dodge(width = 0.9),
            vjust = -0.5, size = 3.5) +
  scale_fill_manual(values = c("Yes" = "coral", "No" = "steelblue")) +
  scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
  labs(title = "Disease Rates by Vaccine Status",
       subtitle = paste0("χ² = ", round(chi_vaccine$statistic, 2),
                        ", p = ", round(chi_vaccine$p.value, 4)),
       x = "Vaccine Status", y = "Proportion", fill = "Disease") +
  theme_minimal()

18.6 Effect Size: Cramér’s V

Chi-square tells us if an association exists, but not how strong it is. For effect size, we use Cramér’s V:

\[ V = \sqrt{\frac{\chi^2}{n \times (k - 1)}} \]

where n = sample size, k = min(rows, columns)

Code
# Calculate Cramér's V
n <- sum(vaccine_table)
k <- min(nrow(vaccine_table), ncol(vaccine_table))
cramers_v <- sqrt(chi_vaccine$statistic / (n * (k - 1)))
cramers_v
X-squared 
0.2394737 
Code
# Interpretation guide
cat("Cramér's V =", round(cramers_v, 3), "\n")
Cramér's V = 0.239 
Code
cat("Effect size interpretation (Cohen's guidelines for 2x2 tables):\n")
Effect size interpretation (Cohen's guidelines for 2x2 tables):
Code
cat("  Small:  V = 0.10\n")
  Small:  V = 0.10
Code
cat("  Medium: V = 0.30\n")
  Medium: V = 0.30
Code
cat("  Large:  V = 0.50\n")
  Large:  V = 0.50

Our effect size (V = 0.239) suggests a small to medium association.

18.7 Larger Tables

Chi-square tests work with larger contingency tables too. Example: Three housing systems × three health outcomes:

Code
# Simulate housing system study
set.seed(123)
housing_data <- tibble(
  housing = sample(c("Confinement", "Pasture", "Mixed"),
                   300, replace = TRUE)
) %>%
  rowwise() %>%
  mutate(
    health = case_when(
      housing == "Confinement" ~ sample(c("Healthy", "Mild", "Severe"), 1,
                                        prob = c(0.60, 0.30, 0.10)),
      housing == "Pasture" ~ sample(c("Healthy", "Mild", "Severe"), 1,
                                    prob = c(0.75, 0.20, 0.05)),
      housing == "Mixed" ~ sample(c("Healthy", "Mild", "Severe"), 1,
                                  prob = c(0.70, 0.25, 0.05))
    )
  ) %>%
  ungroup()

# Contingency table
housing_table <- table(housing_data$housing, housing_data$health)
housing_table
             
              Healthy Mild Severe
  Confinement      66   35     11
  Mixed            68   20      2
  Pasture          75   22      1
Code
# Chi-square test
chi_housing <- chisq.test(housing_table)
tidy(chi_housing)
# A tibble: 1 × 4
  statistic p.value parameter method                    
      <dbl>   <dbl>     <int> <chr>                     
1      15.5 0.00384         4 Pearson's Chi-squared test
Code
# Mosaic-style visualization
housing_data %>%
  count(housing, health) %>%
  ggplot(aes(x = housing, y = n, fill = health)) +
  geom_col(position = "fill") +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Health Status by Housing System",
       subtitle = paste0("χ² = ", round(chi_housing$statistic, 2),
                        ", p = ", round(chi_housing$p.value, 4)),
       x = "Housing System", y = "Proportion", fill = "Health Status") +
  theme_minimal()

19 Fisher’s Exact Test

When sample sizes are small or expected cell counts are < 5, the chi-square approximation may be poor. In these cases, use Fisher’s exact test.

19.1 When to Use Fisher’s Exact Test

  1. Any expected cell count < 5
  2. Small sample sizes (n < 20 or so)
  3. When you want an exact p-value (not an approximation)
NoteFisher’s Exact Test vs Chi-Square

Fisher’s exact test calculates the exact probability of observing the data (and more extreme tables) under the null hypothesis of independence. Chi-square uses an approximation based on the chi-square distribution.

For large samples, results are nearly identical. For small samples, Fisher’s test is more reliable.

19.2 Example: Rare Disease Outbreak

Suppose we’re investigating a rare disease in a small herd. Only 15 animals total:

Code
# Small sample data
rare_disease <- matrix(c(7, 1,    # Exposed
                         4, 3),   # Not exposed
                       nrow = 2, byrow = TRUE,
                       dimnames = list(Exposure = c("Exposed", "Not Exposed"),
                                      Disease = c("Yes", "No")))
rare_disease
             Disease
Exposure      Yes No
  Exposed       7  1
  Not Exposed   4  3

Check expected counts:

Code
chisq.test(rare_disease)$expected
             Disease
Exposure           Yes       No
  Exposed     5.866667 2.133333
  Not Exposed 5.133333 1.866667

Expected count in one cell is 3.2 (close to the threshold). Let’s use Fisher’s exact test:

Code
fisher_result <- fisher.test(rare_disease)
fisher_result

    Fisher's Exact Test for Count Data

data:  rare_disease
p-value = 0.2821
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
   0.2680235 313.3252444
sample estimates:
odds ratio 
  4.676004 
Code
tidy(fisher_result)
# A tibble: 1 × 6
  estimate p.value conf.low conf.high method                         alternative
     <dbl>   <dbl>    <dbl>     <dbl> <chr>                          <chr>      
1     4.68   0.282    0.268      313. Fisher's Exact Test for Count… two.sided  

Interpretation: - Exact p-value = 0.2821 - We reject the null hypothesis at α = 0.05 - There is evidence of an association between exposure and disease

Compare to chi-square (for demonstration):

Code
chi_rare <- chisq.test(rare_disease)

tibble(
  Test = c("Chi-Square", "Fisher's Exact"),
  `P-value` = c(chi_rare$p.value, fisher_result$p.value)
) %>%
  kable(digits = 4,
        caption = "Comparison of Tests on Small Sample Data")
Comparison of Tests on Small Sample Data
Test P-value
Chi-Square 0.4586
Fisher’s Exact 0.2821

The warning on chi-square confirms we should use Fisher’s test here.

19.3 Computational Note

Fisher’s exact test can be computationally intensive for large tables. Most software (including R) can handle 2×2 tables easily, but larger tables may take time or use simulation-based approximations.

20 Risk Ratios, Odds Ratios, and 2×2 Tables

Chi-square and Fisher’s tests tell us whether an association exists. To quantify how strong the association is, we use risk ratios and odds ratios.

20.1 2×2 Table Notation

For a standard 2×2 table:

Disease Yes Disease No Total
Exposed a b a+b
Not Exposed c d c+d
Total a+c b+d n

20.2 Risk Ratio (Relative Risk)

The risk ratio (RR) compares the risk of disease in exposed vs unexposed groups:

\[ RR = \frac{a/(a+b)}{c/(c+d)} = \frac{\text{Risk in exposed}}{\text{Risk in unexposed}} \]

Interpretation: - RR = 1: No association - RR > 1: Exposure increases risk - RR < 1: Exposure decreases risk (protective)

20.2.1 Example: Vaccine Efficacy

Code
# Our vaccine data in 2x2 format
vaccine_table
            
             No Yes
  Control    67  33
  Vaccinated 88  12
Code
# Extract counts
a <- vaccine_table["Vaccinated", "Yes"]    # Vaccinated & diseased
b <- vaccine_table["Vaccinated", "No"]     # Vaccinated & not diseased
c <- vaccine_table["Control", "Yes"]       # Control & diseased
d <- vaccine_table["Control", "No"]        # Control & not diseased

# Calculate risks
risk_vaccinated <- a / (a + b)
risk_control <- c / (c + d)

# Risk ratio
risk_ratio <- risk_vaccinated / risk_control
risk_ratio
[1] 0.3636364
Code
cat("Risk in vaccinated group:", round(risk_vaccinated, 3), "\n")
Risk in vaccinated group: 0.12 
Code
cat("Risk in control group:", round(risk_control, 3), "\n")
Risk in control group: 0.33 
Code
cat("Risk Ratio:", round(risk_ratio, 3), "\n")
Risk Ratio: 0.364 

Interpretation: The vaccinated group has 0.36 times the risk of disease compared to controls. Since RR < 1, vaccination is protective.

20.2.2 Vaccine Efficacy

Vaccine efficacy = 1 - RR = proportion of disease prevented by vaccination

Code
vaccine_efficacy <- (1 - risk_ratio) * 100
cat("Vaccine efficacy:", round(vaccine_efficacy, 1), "%\n")
Vaccine efficacy: 63.6 %

The vaccine prevents approximately 64% of disease cases.

20.3 Odds Ratio

The odds ratio (OR) compares the odds of disease in exposed vs unexposed:

\[ OR = \frac{a/b}{c/d} = \frac{a \times d}{b \times c} \]

Odds are different from risk (probability): - Risk = a / (a+b) - Odds = a / b

Code
# Calculate odds
odds_vaccinated <- a / b
odds_control <- c / d

# Odds ratio
odds_ratio <- odds_vaccinated / odds_control
# OR: Simplified calculation
odds_ratio_simple <- (a * d) / (b * c)

cat("Odds in vaccinated group:", round(odds_vaccinated, 3), "\n")
Odds in vaccinated group: 0.136 
Code
cat("Odds in control group:", round(odds_control, 3), "\n")
Odds in control group: 0.493 
Code
cat("Odds Ratio:", round(odds_ratio, 3), "\n")
Odds Ratio: 0.277 
ImportantRisk Ratio vs Odds Ratio
  • Risk Ratio: More intuitive, easier to interpret
  • Odds Ratio: More commonly used in logistic regression and case-control studies
  • When the outcome is rare (< 10%), OR ≈ RR
  • When the outcome is common, OR and RR can differ substantially

In our vaccine example (disease rates ~12-31%), RR = 0.36 and OR = 0.28 are somewhat different.

20.4 Confidence Intervals

Point estimates are useful, but we also need uncertainty quantification. Let’s calculate 95% CIs:

Code
# Using prop.test for risk ratio CI
prop_test <- prop.test(c(a, c), c(a+b, c+d))

# For OR, use logarithmic CI
log_or <- log(odds_ratio)
se_log_or <- sqrt(1/a + 1/b + 1/c + 1/d)
ci_log_or <- log_or + c(-1.96, 1.96) * se_log_or
ci_or <- exp(ci_log_or)

# Display results
tibble(
  Measure = c("Risk Ratio", "Odds Ratio"),
  Estimate = c(risk_ratio, odds_ratio),
  `95% CI Lower` = c(prop_test$estimate[1] / prop_test$estimate[2] *
                       exp(-1.96 * sqrt(1/(a+b) + 1/(c+d))),
                     ci_or[1]),
  `95% CI Upper` = c(prop_test$estimate[1] / prop_test$estimate[2] *
                       exp(1.96 * sqrt(1/(a+b) + 1/(c+d))),
                     ci_or[2])
) %>%
  kable(digits = 3,
        caption = "Risk Ratio and Odds Ratio with 95% Confidence Intervals")
Risk Ratio and Odds Ratio with 95% Confidence Intervals
Measure Estimate 95% CI Lower 95% CI Upper
Risk Ratio 0.364 0.276 0.480
Odds Ratio 0.277 0.133 0.576

Since both CIs exclude 1.0, we conclude there’s a statistically significant association.

21 Introduction to Logistic Regression

So far, we’ve analyzed relationships between two categorical variables (vaccine × disease, housing × health). But what if we want to:

  1. Include continuous predictors (age, weight, temperature)?
  2. Control for multiple variables simultaneously?
  3. Predict the probability of an outcome for a new individual?

This is where logistic regression comes in.

21.1 Why Not Linear Regression?

With a binary outcome (0/1, Yes/No), linear regression has problems:

Code
# Create data with binary outcome
disease_weight <- tibble(
  weight = runif(100, 200, 400),
  disease = rbinom(100, 1, prob = plogis(-5 + 0.015 * weight))
)

# Try linear regression (wrong!)
lm_wrong <- lm(disease ~ weight, data = disease_weight)

# Plot
disease_weight %>%
  ggplot(aes(x = weight, y = disease)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  labs(title = "Why Linear Regression Fails with Binary Outcomes",
       subtitle = "Predicted values can be < 0 or > 1 (impossible for probabilities!)",
       x = "Weight (kg)", y = "Disease (0 = No, 1 = Yes)") +
  theme_minimal()

Problems with linear regression for binary outcomes: - Predicted values can be < 0 or > 1 - Residuals are not normally distributed - Variance is not constant (heteroscedasticity)

21.2 The Logistic Function

Logistic regression uses a special link function that constrains predictions to [0, 1]:

\[ P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

This S-shaped curve is called the logistic function or sigmoid function:

Code
tibble(x = seq(-6, 6, 0.1)) %>%
  mutate(probability = plogis(x)) %>%  # plogis() = logistic function
  ggplot(aes(x = x, y = probability)) +
  geom_line(size = 1.5, color = "steelblue") +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", alpha = 0.5) +
  geom_hline(yintercept = 0.5, linetype = "dotted", alpha = 0.5) +
  labs(title = "The Logistic Function",
       subtitle = "Output is always between 0 and 1",
       x = "Linear Predictor (β₀ + β₁X)",
       y = "Probability P(Y = 1)") +
  theme_minimal()

21.3 Understanding Log-Odds (Logit)

The logistic regression equation can be rewritten as:

\[ \log\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 X \]

The left side is the log-odds (also called the logit). This makes the relationship linear in log-odds space.

NoteThree Ways to Think About Logistic Regression
  1. Probability scale: P(Y=1) is a curve between 0 and 1
  2. Odds scale: Odds = P/(1-P), ranges from 0 to ∞
  3. Log-odds scale: log(Odds), ranges from -∞ to +∞ (linear!)

Coefficients in logistic regression are in log-odds scale.

21.4 Simple Logistic Regression Example

Let’s model disease risk as a function of body weight:

Code
# Fit logistic regression
logistic_model <- glm(disease ~ weight,
                      data = disease_weight,
                      family = binomial)

# Summary
summary(logistic_model)

Call:
glm(formula = disease ~ weight, family = binomial, data = disease_weight)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -4.015138   1.302445  -3.083  0.00205 **
weight       0.011715   0.004264   2.747  0.00601 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 131.79  on 99  degrees of freedom
Residual deviance: 123.55  on 98  degrees of freedom
AIC: 127.55

Number of Fisher Scoring iterations: 4
Code
# Tidy output
tidy(logistic_model, conf.int = TRUE, conf.method = "Wald") %>%
  kable(digits = 4,
        caption = "Logistic Regression: Disease ~ Weight")
Logistic Regression: Disease ~ Weight
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -4.0151 1.3024 -3.0828 0.0021 -6.7092 -1.5660
weight 0.0117 0.0043 2.7474 0.0060 0.0036 0.0205

Interpreting coefficients (in log-odds scale): - Intercept (β₀) = -4.02: Log-odds of disease when weight = 0 (not meaningful here) - Weight (β₁) = 0.0117: For each 1 kg increase in weight, log-odds of disease increase by 0.0117

21.5 Converting to Odds Ratios

To get odds ratios, exponentiate the coefficients:

Code
# Odds ratios
tidy(logistic_model, conf.int = TRUE, exponentiate = TRUE, conf.method = "Wald") %>%
  kable(digits = 4,
        caption = "Odds Ratios from Logistic Regression")
Odds Ratios from Logistic Regression
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.0180 1.3024 -3.0828 0.0021 0.0012 0.2089
weight 1.0118 0.0043 2.7474 0.0060 1.0036 1.0207

Interpretation: For each 1 kg increase in weight, the odds of disease are multiplied by 1.012.

Since OR > 1, higher weight is associated with higher disease risk.

ImportantInterpreting Odds Ratios
  • OR = 1: No association (predictor doesn’t affect odds)
  • OR > 1: Predictor increases odds of outcome
  • OR < 1: Predictor decreases odds of outcome (protective)

Example: OR = 1.02 means odds increase by 2% per unit increase in predictor.

21.6 Predicted Probabilities

We can use the model to predict disease probability for any weight:

Code
# Predict probabilities for a range of weights
pred_data <- tibble(
  weight = seq(200, 400, by = 10)
)

pred_data <- pred_data %>%
  mutate(
    predicted_prob = predict(logistic_model, newdata = pred_data,
                            type = "response")
  )

# Show some predictions
pred_data %>%
  filter(weight %in% c(200, 250, 300, 350, 400)) %>%
  kable(digits = 3,
        caption = "Predicted Disease Probabilities by Weight")
Predicted Disease Probabilities by Weight
weight predicted_prob
200 0.158
250 0.252
300 0.377
350 0.521
400 0.662
Code
# Visualize predictions
disease_weight %>%
  ggplot(aes(x = weight, y = disease)) +
  geom_point(alpha = 0.4, size = 2) +
  geom_line(data = pred_data,
            aes(x = weight, y = predicted_prob),
            color = "blue", size = 1.5) +
  labs(title = "Logistic Regression: Disease Risk by Weight",
       subtitle = "Blue curve shows predicted probabilities",
       x = "Weight (kg)",
       y = "Probability of Disease") +
  theme_minimal()

21.7 Model Fit and Diagnostics

21.7.1 Deviance

Similar to residual sum of squares in linear regression, logistic regression uses deviance to measure model fit.

Code
glance(logistic_model) %>%
  kable(digits = 2,
        caption = "Logistic Regression Model Fit Statistics")
Logistic Regression Model Fit Statistics
null.deviance df.null logLik AIC BIC deviance df.residual nobs
131.79 99 -61.78 127.55 132.76 123.55 98 100
  • Null deviance: Model with intercept only (no predictors)
  • Residual deviance: Model with predictors
  • Lower deviance = better fit

21.7.2 Likelihood Ratio Test

Compare models with a likelihood ratio test:

Code
# Null model (intercept only)
null_model <- glm(disease ~ 1, data = disease_weight, family = binomial)

# Compare models
anova(null_model, logistic_model, test = "Chisq") %>%
  kable(digits = 4,
        caption = "Likelihood Ratio Test: Does Weight Improve the Model?")
Likelihood Ratio Test: Does Weight Improve the Model?
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
99 131.7911 NA NA NA
98 123.5509 1 8.2402 0.0041

Weight significantly improves model fit (p < 0.05).

21.8 Multiple Logistic Regression

We can include multiple predictors, just like multiple linear regression:

Code
# Add more predictors
disease_multi <- disease_weight %>%
  mutate(
    age = runif(n(), 1, 5),
    previous_disease = sample(c(0, 1), n(), replace = TRUE, prob = c(0.7, 0.3))
  )

# Fit multiple logistic regression
multi_logistic <- glm(disease ~ weight + age + previous_disease,
                      data = disease_multi,
                      family = binomial)

# Odds ratios
tidy(multi_logistic, conf.int = TRUE, exponentiate = TRUE, conf.method = "Wald") %>%
  kable(digits = 4,
        caption = "Multiple Logistic Regression: Odds Ratios")
Multiple Logistic Regression: Odds Ratios
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.0116 1.3896 -3.2073 0.0013 0.0006 0.1572
weight 1.0113 0.0043 2.6080 0.0091 1.0030 1.0202
age 1.2182 0.1844 1.0702 0.2845 0.8500 1.7606
previous_disease 0.9770 0.4772 -0.0487 0.9611 0.3760 2.4755

Interpretation (if results were significant): - Each 1 kg increase in weight multiplies odds of disease by 1.011, holding age and previous disease constant - Previous disease history increases odds by a factor of 0.98, adjusting for weight and age

21.9 Practical Example: BRD Risk Factors

Let’s analyze a more realistic dataset with bovine respiratory disease (BRD):

Code
# Simulate realistic BRD data
set.seed(456)
brd_data <- tibble(
  animal_id = 1:300,
  arrival_weight = rnorm(300, mean = 250, sd = 30),
  age_months = runif(300, min = 6, max = 10),
  vaccinated = sample(c("Yes", "No"), 300, replace = TRUE),
  temperature_arrival = rnorm(300, mean = 39.2, sd = 0.8)
) %>%
  mutate(
    # Disease risk depends on multiple factors
    risk_score = -8 +
      0.01 * arrival_weight +
      0.3 * age_months -
      1.5 * (vaccinated == "Yes") +
      0.5 * temperature_arrival,
    brd = rbinom(300, 1, prob = plogis(risk_score))
  )

# Check disease rate
mean(brd_data$brd)
[1] 1
Code
# Fit comprehensive model
brd_model <- glm(brd ~ arrival_weight + age_months + vaccinated + temperature_arrival,
                 data = brd_data,
                 family = binomial)

# Get results with Wald CIs
brd_results <- tidy(brd_model, exponentiate = TRUE)
brd_ci <- confint.default(brd_model) # Wald CIs
brd_results$conf.low <- exp(brd_ci[,1])
brd_results$conf.high <- exp(brd_ci[,2])

# Display results
brd_results %>%
  select(term, estimate, conf.low, conf.high, p.value) %>%
  kable(digits = 4,
        caption = "BRD Risk Factors: Odds Ratios with 95% Wald CIs")
BRD Risk Factors: Odds Ratios with 95% Wald CIs
term estimate conf.low conf.high p.value
(Intercept) 344744308613 0 Inf 1
arrival_weight 1 0 Inf 1
age_months 1 0 Inf 1
vaccinatedYes 1 0 Inf 1
temperature_arrival 1 0 Inf 1
Code
# Extract key odds ratios
or_vaccine <- exp(coef(brd_model)["vaccinatedYes"])
or_temp <- exp(coef(brd_model)["temperature_arrival"])

cat("Key findings:\n")
Key findings:
Code
cat("- Vaccination reduces odds of BRD by",
    round((1 - or_vaccine) * 100, 1), "%\n")
- Vaccination reduces odds of BRD by 0 %
Code
cat("- Each 1°C increase in arrival temperature multiplies odds by",
    round(or_temp, 2), "\n")
- Each 1°C increase in arrival temperature multiplies odds by 1 

21.10 Predictions for New Animals

Code
# Predict BRD risk for specific scenarios
new_animals <- tibble(
  scenario = c("Low risk", "Medium risk", "High risk"),
  arrival_weight = c(280, 250, 220),
  age_months = c(7, 8, 9),
  vaccinated = c("Yes", "No", "No"),
  temperature_arrival = c(38.5, 39.2, 40.0)
)

new_animals <- new_animals %>%
  mutate(
    predicted_risk = predict(brd_model, newdata = new_animals, type = "response")
  )

new_animals %>%
  select(scenario, vaccinated, temperature_arrival, predicted_risk) %>%
  kable(digits = 3,
        caption = "Predicted BRD Risk for Different Scenarios")
Predicted BRD Risk for Different Scenarios
scenario vaccinated temperature_arrival predicted_risk
Low risk Yes 38.5 1
Medium risk No 39.2 1
High risk No 40.0 1
NoteWhen to Use Logistic Regression vs Chi-Square
  • Chi-square test: Quick test of association between two categorical variables
  • Logistic regression:
    • Include continuous predictors
    • Control for confounders
    • Make predictions
    • Model complex relationships
    • Get odds ratios with confidence intervals

22 Assumptions and Considerations

22.1 Chi-Square Test Assumptions

  1. Independence: Each observation must be independent

    • Violated by: Repeated measures, clustered data (multiple animals per farm)
    • Solution: Use mixed models or GEE for clustered data
  2. Expected cell counts ≥ 5: Rule of thumb for valid chi-square approximation

    • Check: Use chisq.test()$expected
    • Solution: Fisher’s exact test, combine categories, or collect more data
  3. Random sampling: Data should be representative of population

  4. Mutually exclusive categories: Each observation in exactly one category

22.2 Logistic Regression Assumptions

  1. Binary outcome: DV must be 0/1 (or convertible to binary)

  2. Independence: Observations are independent

  3. Linearity of log-odds: Relationship between continuous predictors and log-odds should be linear

    • Check: Plot log-odds vs predictor, or use splines
  4. No perfect multicollinearity: Predictors shouldn’t be perfectly correlated

  5. Adequate sample size: Rule of thumb: 10-15 events per predictor variable

WarningCommon Pitfalls
  1. Small expected cell counts: Chi-square p-values unreliable → Use Fisher’s exact
  2. Multiple comparisons: Testing many associations increases false positives → Adjust for multiple testing
  3. Confusing OR and RR: Odds ratios ≠ Risk ratios (except for rare outcomes)
  4. Causal language: “Association” ≠ “Causation” → Observational data can’t prove causation
  5. Overfitting logistic models: Too many predictors for sample size → Use fewer predictors or collect more data

22.3 Sample Size Considerations

For chi-square tests, use power analysis:

Code
# What sample size for 80% power to detect OR = 2.0?
# Assuming equal groups, alpha = 0.05

# Simulate to demonstrate concept
simulate_power <- function(n_per_group, or, n_sims = 1000) {
  p_control <- 0.20  # 20% baseline risk
  p_exposed <- p_control * or / (1 - p_control + p_control * or)

  p_values <- replicate(n_sims, {
    exposed <- sample(c("Yes", "No"), n_per_group, replace = TRUE,
                     prob = c(p_exposed, 1 - p_exposed))
    control <- sample(c("Yes", "No"), n_per_group, replace = TRUE,
                     prob = c(p_control, 1 - p_control))

    outcome <- c(exposed, control)
    group <- rep(c("Exposed", "Control"), each = n_per_group)

    test_result <- tryCatch(
      chisq.test(table(group, outcome))$p.value,
      error = function(e) NA,
      warning = function(w) NA
    )

    test_result
  })

  power <- mean(p_values < 0.05, na.rm = TRUE)
  return(power)
}

# Test different sample sizes
sample_sizes <- seq(50, 200, by = 25)
power_results <- map_dbl(sample_sizes, ~simulate_power(.x, or = 2.0, n_sims = 500))

# Plot
tibble(n_per_group = sample_sizes, power = power_results) %>%
  ggplot(aes(x = n_per_group, y = power)) +
  geom_line(size = 1, color = "steelblue") +
  geom_point(size = 2) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "red") +
  labs(title = "Statistical Power vs Sample Size",
       subtitle = "Detecting OR = 2.0 with 20% baseline risk",
       x = "Sample Size per Group",
       y = "Power (1 - β)") +
  theme_minimal()

23 Practical Workflow: Choosing the Right Test

Here’s a decision tree for categorical data analysis:

23.1 Decision Framework


START: Do you have categorical data?
│
├─ One categorical variable
│  └─ Chi-square goodness of fit test
│     (Do observed frequencies match expected?)
│
└─ Two or more variables
   │
   ├─ Both categorical (no continuous predictors)
   │  │
   │  ├─ 2×2 table, large sample (expected counts ≥ 5)
   │  │  └─ Chi-square test of independence
   │  │
   │  ├─ 2×2 table, small sample (expected counts < 5)
   │  │  └─ Fisher's exact test
   │  │
   │  └─ Larger table
   │     └─ Chi-square test of independence
   │
   └─ Binary outcome + continuous predictors
      │
      ├─ Single predictor, just testing association
      │  └─ Chi-square test (if categorized) or
      │     Logistic regression (for continuous)
      │
      └─ Multiple predictors or need predictions
         └─ Logistic regression

23.2 Complete Analysis Example

Let’s walk through a complete analysis with a new dataset:

Code
# Research question: Does feed type affect mortality in broiler chickens?
# Also considering weight and age as covariates

set.seed(789)
broiler_data <- tibble(
  bird_id = 1:400,
  feed_type = sample(c("Standard", "Enhanced"), 400, replace = TRUE),
  weight_kg = rnorm(400, mean = 2.5, sd = 0.4),
  age_days = sample(35:49, 400, replace = TRUE)
) %>%
  mutate(
    mortality_risk = plogis(-6 +
                           0.5 * (feed_type == "Enhanced") -
                           0.8 * weight_kg +
                           0.05 * age_days),
    mortality = rbinom(400, 1, prob = mortality_risk),
    mortality_fct = factor(mortality, levels = c(0, 1), labels = c("Alive", "Dead"))
  )

23.2.1 Step 1: Exploratory Analysis

Code
# Overall mortality rate
broiler_data %>%
  count(mortality_fct) %>%
  mutate(percent = n / sum(n) * 100) %>%
  kable(digits = 1, caption = "Overall Mortality")
Overall Mortality
mortality_fct n percent
Alive 397 99.2
Dead 3 0.8
Code
# Mortality by feed type
broiler_data %>%
  count(feed_type, mortality_fct) %>%
  group_by(feed_type) %>%
  mutate(percent = n / sum(n) * 100) %>%
  kable(digits = 1, caption = "Mortality by Feed Type")
Mortality by Feed Type
feed_type mortality_fct n percent
Enhanced Alive 216 98.6
Enhanced Dead 3 1.4
Standard Alive 181 100.0
Code
# Visualization
broiler_data %>%
  ggplot(aes(x = feed_type, fill = mortality_fct)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("Alive" = "steelblue", "Dead" = "coral")) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Mortality Rate by Feed Type",
       x = "Feed Type", y = "Proportion", fill = "Status") +
  theme_minimal()

23.2.2 Step 2: Simple Association Test

Code
# Chi-square test
mortality_table <- table(broiler_data$feed_type, broiler_data$mortality_fct)
chi_mortality <- chisq.test(mortality_table)
tidy(chi_mortality) %>%
  kable(digits = 4, caption = "Chi-Square Test: Feed Type vs Mortality")
Chi-Square Test: Feed Type vs Mortality
statistic p.value parameter method
0.9968 0.3181 1 Pearson’s Chi-squared test with Yates’ continuity correction

23.2.3 Step 3: Calculate Effect Size (Odds Ratio)

Code
# Calculate odds ratio
a <- mortality_table["Enhanced", "Dead"]
b <- mortality_table["Enhanced", "Alive"]
c <- mortality_table["Standard", "Dead"]
d <- mortality_table["Standard", "Alive"]

or_simple <- (a * d) / (b * c)
cat("Odds Ratio (Enhanced vs Standard):", round(or_simple, 3), "\n")
Odds Ratio (Enhanced vs Standard): Inf 

23.2.4 Step 4: Logistic Regression (Adjusting for Confounders)

Code
# Fit logistic regression
mortality_model <- glm(mortality ~ feed_type + weight_kg + age_days,
                       data = broiler_data,
                       family = binomial)

# Results
tidy(mortality_model, conf.int = TRUE, exponentiate = TRUE, conf.method = "Wald") %>%
  kable(digits = 4, caption = "Adjusted Odds Ratios from Logistic Regression")
Adjusted Odds Ratios from Logistic Regression
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.0026 6.8342 -0.8714 0.3836 0.0000 1.090366e+03
feed_typeStandard 0.0000 2155.7729 -0.0080 0.9936 NA 4.349210e+123
weight_kg 0.6014 1.5840 -0.3210 0.7482 0.0232 1.339860e+01
age_days 1.0724 0.1348 0.5188 0.6039 0.8218 1.447500e+00

23.2.5 Step 5: Interpret and Report

Code
# Extract key results
or_adjusted <- exp(coef(mortality_model)["feed_typeStandard"])
ci_adjusted <- exp(confint(mortality_model)["feed_typeStandard", ])

cat("Findings:\n")
Findings:
Code
cat("1. Unadjusted OR (chi-square):", round(or_simple, 2), "\n")
1. Unadjusted OR (chi-square): Inf 
Code
cat("2. Adjusted OR (logistic regression):", round(or_adjusted, 2), "\n")
2. Adjusted OR (logistic regression): 0 
Code
cat("   95% CI: [", round(ci_adjusted[1], 2), ",", round(ci_adjusted[2], 2), "]\n")
   95% CI: [ NA , 4.34921e+123 ]
Code
cat("\nInterpretation:\n")

Interpretation:
Code
cat("After adjusting for weight and age, feed type is",
    ifelse(summary(mortality_model)$coefficients["feed_typeStandard", 4] < 0.05,
           "significantly", "not significantly"),
    "associated with mortality.\n")
After adjusting for weight and age, feed type is not significantly associated with mortality.

24 Practice Problems

Work through these problems to reinforce your understanding.

24.1 Problem 1: Mastitis and Housing

A dairy farm compares mastitis rates across three housing systems:

Code
# Data
mastitis_data <- tibble(
  housing = rep(c("Freestall", "Tiestall", "Pasture"), each = 80),
  mastitis = c(
    sample(c("Yes", "No"), 80, replace = TRUE, prob = c(0.25, 0.75)),  # Freestall
    sample(c("Yes", "No"), 80, replace = TRUE, prob = c(0.35, 0.65)),  # Tiestall
    sample(c("Yes", "No"), 80, replace = TRUE, prob = c(0.15, 0.85))   # Pasture
  )
)

# Your tasks:
# 1. Create a contingency table
# 2. Perform chi-square test
# 3. Calculate Cramér's V
# 4. Visualize the results
# 5. Interpret findings

Solution:

Code
# 1. Contingency table
mastitis_table <- table(mastitis_data$housing, mastitis_data$mastitis)
addmargins(mastitis_table)
           
             No Yes Sum
  Freestall  58  22  80
  Pasture    73   7  80
  Tiestall   53  27  80
  Sum       184  56 240
Code
# 2. Chi-square test
chi_mastitis <- chisq.test(mastitis_table)
tidy(chi_mastitis) %>%
  kable(digits = 4, caption = "Chi-Square Test Results")
Chi-Square Test Results
statistic p.value parameter method
15.1398 5e-04 2 Pearson’s Chi-squared test
Code
# 3. Cramér's V
n <- sum(mastitis_table)
k <- min(dim(mastitis_table))
cramers_v_mastitis <- sqrt(chi_mastitis$statistic / (n * (k - 1)))
cat("Cramér's V:", round(cramers_v_mastitis, 3), "\n")
Cramér's V: 0.251 
Code
# 4. Visualization
mastitis_data %>%
  count(housing, mastitis) %>%
  group_by(housing) %>%
  mutate(prop = n / sum(n)) %>%
  filter(mastitis == "Yes") %>%
  ggplot(aes(x = housing, y = prop, fill = housing)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(round(prop * 100, 1), "%")),
            vjust = -0.5) +
  scale_y_continuous(labels = percent_format(), limits = c(0, 0.4)) +
  labs(title = "Mastitis Rate by Housing System",
       x = "Housing System", y = "Mastitis Rate") +
  theme_minimal()

Code
# 5. Interpretation
cat("\nInterpretation:\n")

Interpretation:
Code
cat("There is a statistically significant association between housing system")
There is a statistically significant association between housing system
Code
cat(" and mastitis rate (p =", round(chi_mastitis$p.value, 4), ").\n")
 and mastitis rate (p = 5e-04 ).
Code
cat("Pasture-based systems show the lowest mastitis rate, while tiestall")
Pasture-based systems show the lowest mastitis rate, while tiestall
Code
cat(" systems show the highest.\n")
 systems show the highest.

24.2 Problem 2: Small Sample Fisher’s Test

Investigate a rare genetic disorder in a small sample:

Code
# Data: Does dam age affect disorder occurrence?
genetic_data <- matrix(c(5, 2,   # Young dams
                        1, 7),   # Older dams
                      nrow = 2, byrow = TRUE,
                      dimnames = list(Dam_Age = c("Young", "Older"),
                                     Disorder = c("Yes", "No")))

# Your tasks:
# 1. Check expected cell counts
# 2. Perform Fisher's exact test
# 3. Calculate odds ratio
# 4. Interpret results

Solution:

Code
# 1. Check expected counts
chi_test <- chisq.test(genetic_data)
cat("Expected counts:\n")
Expected counts:
Code
print(chi_test$expected)
       Disorder
Dam_Age Yes  No
  Young 2.8 4.2
  Older 3.2 4.8
Code
cat("\nNote: Expected counts < 5, so Fisher's exact test is appropriate.\n\n")

Note: Expected counts < 5, so Fisher's exact test is appropriate.
Code
# 2. Fisher's exact test
fisher_genetic <- fisher.test(genetic_data)
cat("Fisher's Exact Test Results:\n")
Fisher's Exact Test Results:
Code
cat("Odds Ratio:", round(fisher_genetic$estimate, 3), "\n")
Odds Ratio: 13.594 
Code
cat("95% CI: [", round(fisher_genetic$conf.int[1], 3), ",",
    round(fisher_genetic$conf.int[2], 3), "]\n")
95% CI: [ 0.865 , 934.009 ]
Code
cat("P-value:", round(fisher_genetic$p.value, 4), "\n\n")
P-value: 0.0406 
Code
# 3. Manual OR calculation
or_manual <- (genetic_data[1,1] * genetic_data[2,2]) /
             (genetic_data[1,2] * genetic_data[2,1])
cat("Manual OR calculation:", round(or_manual, 3), "\n\n")
Manual OR calculation: 17.5 
Code
# 4. Interpretation
cat("Interpretation:\n")
Interpretation:
Code
cat("Young dams have", round(fisher_genetic$estimate, 1),
    "times higher odds of having offspring with the disorder")
Young dams have 13.6 times higher odds of having offspring with the disorder
Code
cat(" compared to older dams.\n")
 compared to older dams.
Code
cat("However, with p =", round(fisher_genetic$p.value, 3),
    "and a small sample size,\n")
However, with p = 0.041 and a small sample size,
Code
cat("we should interpret these results cautiously and consider collecting more data.\n")
we should interpret these results cautiously and consider collecting more data.

24.3 Problem 3: Logistic Regression with Multiple Predictors

Analyze factors affecting retained placenta in dairy cows:

Code
# Simulate data
set.seed(2025)
rp_data <- tibble(
  cow_id = 1:250,
  parity = sample(1:5, 250, replace = TRUE),
  bcs_precalving = rnorm(250, mean = 3.5, sd = 0.5),
  twin_birth = sample(c(0, 1), 250, replace = TRUE, prob = c(0.95, 0.05)),
  dystocia = sample(c(0, 1), 250, replace = TRUE, prob = c(0.85, 0.15))
) %>%
  mutate(
    rp_risk = plogis(-4 +
                    0.2 * (parity >= 3) +
                    -0.5 * bcs_precalving +
                    2.0 * twin_birth +
                    1.5 * dystocia),
    retained_placenta = rbinom(250, 1, prob = rp_risk)
  )

# Your tasks:
# 1. Fit a logistic regression model
# 2. Interpret odds ratios for each predictor
# 3. Identify the strongest risk factor
# 4. Predict RP risk for a specific cow profile

Solution:

Code
# 1. Fit model
rp_model <- glm(retained_placenta ~ parity + bcs_precalving + twin_birth + dystocia,
                data = rp_data,
                family = binomial)

# 2. Odds ratios with CIs
rp_results <- tidy(rp_model, conf.int = TRUE, exponentiate = TRUE, conf.method = "Wald")
rp_results %>%
  kable(digits = 3, caption = "Retained Placenta Risk Factors: Odds Ratios")
Retained Placenta Risk Factors: Odds Ratios
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.000000e+00 109713.535 -0.003 0.998 0 Inf
parity 1.255316e+06 9308.983 0.002 0.999 0 NA
bcs_precalving 8.306630e+20 19892.759 0.002 0.998 0 Inf
twin_birth 0.000000e+00 59064.424 0.000 1.000 0 Inf
dystocia 1.517005e+17 21074.857 0.002 0.999 0 Inf
Code
# 3. Identify strongest risk factor
cat("\nStrongest Risk Factors (by OR magnitude):\n")

Strongest Risk Factors (by OR magnitude):
Code
rp_results %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(abs(estimate - 1))) %>%
  select(term, estimate, p.value) %>%
  print()
# A tibble: 4 × 3
  term           estimate p.value
  <chr>             <dbl>   <dbl>
1 bcs_precalving 8.31e+20   0.998
2 dystocia       1.52e+17   0.999
3 parity         1.26e+ 6   0.999
4 twin_birth     5.83e- 6   1.00 
Code
# 4. Predict for specific profiles
new_cows <- tibble(
  scenario = c("Low risk", "High risk"),
  parity = c(2, 4),
  bcs_precalving = c(3.5, 2.8),
  twin_birth = c(0, 1),
  dystocia = c(0, 1)
)

new_cows %>%
  mutate(predicted_rp_risk = predict(rp_model, newdata = new_cows,
                                     type = "response")) %>%
  kable(digits = 3, caption = "Predicted Retained Placenta Risk")
Predicted Retained Placenta Risk
scenario parity bcs_precalving twin_birth dystocia predicted_rp_risk
Low risk 2 3.5 0 0 0
High risk 4 2.8 1 1 0
Code
cat("\nInterpretation:\n")

Interpretation:
Code
cat("- Twin births have the strongest association with retained placenta\n")
- Twin births have the strongest association with retained placenta
Code
cat("- Higher BCS appears protective (OR < 1)\n")
- Higher BCS appears protective (OR < 1)
Code
cat("- Dystocia also substantially increases risk\n")
- Dystocia also substantially increases risk
Code
cat("- A high-risk cow profile (parity 4, low BCS, twins, dystocia) has\n")
- A high-risk cow profile (parity 4, low BCS, twins, dystocia) has
Code
cat("  substantially higher predicted risk than a low-risk cow.\n")
  substantially higher predicted risk than a low-risk cow.

24.4 Problem 4: Goodness of Fit - Genetic Ratios

Test if offspring coat colors match expected Mendelian ratios:

Code
# Observed offspring: Black, Brown, White
observed_colors <- c(Black = 42, Brown = 38, White = 20)

# Expected ratio: 2:2:1
# Your tasks:
# 1. Perform chi-square goodness of fit test
# 2. Visualize observed vs expected
# 3. Interpret whether data fits the genetic model

Solution:

Code
# 1. Chi-square goodness of fit
expected_proportions <- c(2, 2, 1) / 5
chi_colors <- chisq.test(observed_colors, p = expected_proportions)

tidy(chi_colors) %>%
  kable(digits = 4, caption = "Chi-Square Goodness of Fit Test")
Chi-Square Goodness of Fit Test
statistic p.value parameter method
0.2 0.9048 2 Chi-squared test for given probabilities
Code
# 2. Visualize
color_comparison <- tibble(
  Color = names(observed_colors),
  Observed = observed_colors,
  Expected = expected_proportions * sum(observed_colors)
) %>%
  pivot_longer(cols = c(Observed, Expected),
               names_to = "Type", values_to = "Count")

color_comparison %>%
  ggplot(aes(x = Color, y = Count, fill = Type)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(Count, 1)),
            position = position_dodge(width = 0.9), vjust = -0.5) +
  scale_fill_manual(values = c("Observed" = "steelblue",
                                "Expected" = "coral")) +
  labs(title = "Coat Color Distribution: Observed vs Expected",
       subtitle = paste0("χ² = ", round(chi_colors$statistic, 2),
                        ", p = ", round(chi_colors$p.value, 3)),
       x = "Coat Color", y = "Count") +
  theme_minimal()

Code
# 3. Interpretation
cat("\nInterpretation:\n")

Interpretation:
Code
cat("P-value =", round(chi_colors$p.value, 4), "\n")
P-value = 0.9048 
Code
if(chi_colors$p.value > 0.05) {
  cat("The observed data is consistent with the expected 2:2:1 ratio.\n")
  cat("We do not have evidence to reject the proposed genetic model.\n")
} else {
  cat("The observed data significantly differs from the expected 2:2:1 ratio.\n")
  cat("The genetic model may not fit, or other factors may be at play.\n")
}
The observed data is consistent with the expected 2:2:1 ratio.
We do not have evidence to reject the proposed genetic model.

25 Summary and Key Takeaways

This week, we’ve covered a comprehensive toolkit for analyzing categorical data:

25.1 Key Concepts

  1. Chi-Square Goodness of Fit: Tests if observed frequencies match expected distribution
    • One categorical variable
    • Used for genetic ratios, expected proportions
  2. Chi-Square Test of Independence: Tests association between two categorical variables
    • Works with 2×2 or larger tables
    • Requires expected cell counts ≥ 5
  3. Fisher’s Exact Test: Alternative to chi-square for small samples
    • Use when expected counts < 5
    • Provides exact p-value
  4. Risk Ratios and Odds Ratios: Quantify strength of association
    • Risk Ratio: More intuitive, ratio of probabilities
    • Odds Ratio: Used in logistic regression, ratio of odds
    • OR ≈ RR when outcome is rare
  5. Logistic Regression: Model binary outcomes with continuous/multiple predictors
    • Uses logit link function
    • Coefficients are log-odds
    • Exponentiate to get odds ratios
    • Can include multiple predictors and control for confounders

25.2 Decision Framework Summary

Choosing the Right Test for Categorical Data
Situation Recommended Test
One categorical variable vs expected distribution Chi-square goodness of fit
Two categorical variables, large sample Chi-square test of independence
Two categorical variables, small sample Fisher’s exact test
Binary outcome + continuous predictors Logistic regression
Binary outcome + multiple predictors Logistic regression
Need predictions for new observations Logistic regression

25.3 R Functions Summary

Key R Functions for Categorical Data Analysis
Function Purpose
table() Create contingency tables
prop.table() Convert counts to proportions
chisq.test() Chi-square tests (goodness of fit and independence)
fisher.test() Fisher’s exact test for small samples
glm(..., family = binomial) Fit logistic regression model
predict(..., type = 'response') Get predicted probabilities from logistic regression
exp(coef()) Convert log-odds to odds ratios

25.4 Common Pitfalls to Avoid

WarningWatch Out For These Mistakes
  1. Using chi-square with small expected counts → Use Fisher’s exact test
  2. Confusing odds ratios with risk ratios → They’re different! (except for rare outcomes)
  3. Interpreting association as causation → Observational data requires caution
  4. Ignoring effect sizes → Statistical significance ≠ practical importance
  5. Overfitting logistic models → Too many predictors for sample size
  6. Forgetting independence assumption → Clustered/repeated measures violate this
  7. Multiple testing without correction → Increases false positive rate

25.5 Practical Significance

Always consider: - Effect size (Cramér’s V, odds ratios) not just p-values - Confidence intervals for uncertainty - Biological/practical meaning of findings - Study design limitations (observational vs experimental)

26 Additional Resources

26.2 Online Resources

  • StatQuest Videos: “Chi-Square Test” and “Logistic Regression” by Josh Starmer
  • r-bloggers: Tutorials on categorical data analysis in R
  • Cross Validated (StackExchange): Q&A on statistical concepts

26.3 R Packages for Categorical Data

Code
# Beyond base R functions
install.packages(c(
  "vcd",        # Visualizing categorical data
  "epitools",   # Epidemiology tools (OR, RR calculations)
  "rms",        # Regression modeling strategies
  "MASS"        # Ordinal logistic regression (polr)
))

26.4 Next Week Preview: Simple Linear Regression

Next week, we return to continuous outcomes and explore linear regression:

  • Correlation vs regression
  • Fitting a regression line
  • Interpreting slope and intercept
  • Residual diagnostics
  • Prediction and confidence intervals

Connection to This Week: Logistic regression and linear regression are both types of Generalized Linear Models (GLMs). The main difference is: - Linear regression: Continuous outcome, identity link - Logistic regression: Binary outcome, logit link

The concepts of model fitting, interpretation, and diagnostics carry over!

27 Homework Assignment

27.1 Week 6 Homework: Categorical Data Analysis

27.1.1 Instructions

Complete the following analyses using R and Quarto. Submit both your .qmd and rendered .html files.

27.1.2 Part 1: Chi-Square Analysis (30 points)

You’re investigating the relationship between calving ease and calf sex in beef cattle. Use the following data:

Code
# Run this code to create your dataset
set.seed(YOUR_STUDENT_ID)  # Use your actual student ID number
calving_data <- tibble(
  calf_sex = sample(c("Male", "Female"), 200, replace = TRUE),
  calving_ease = sample(c("Easy", "Moderate", "Difficult"), 200,
                        replace = TRUE,
                        prob = c(0.60, 0.30, 0.10))
)

Tasks: 1. Create a contingency table with row and column totals 2. Calculate proportions by calf sex 3. Perform a chi-square test of independence 4. Calculate Cramér’s V to quantify effect size 5. Create a visualization showing calving ease by calf sex 6. Write a 3-4 sentence interpretation of your findings

27.1.3 Part 2: Odds Ratios and Risk Ratios (25 points)

A veterinary study investigates whether pre-weaning vaccination reduces diarrhea in piglets:

  • Vaccinated: 8 with diarrhea, 92 without
  • Unvaccinated: 20 with diarrhea, 80 without

Tasks: 1. Create a 2×2 table 2. Calculate the risk of diarrhea in each group 3. Calculate the risk ratio 4. Calculate the odds ratio 5. Perform Fisher’s exact test 6. Interpret all results, including vaccine efficacy

27.1.4 Part 3: Logistic Regression Analysis (35 points)

Use the following simulated data on lameness in dairy cows:

Code
set.seed(YOUR_STUDENT_ID + 100)
lameness_data <- tibble(
  cow_id = 1:300,
  age_years = runif(300, 2, 8),
  bcs = rnorm(300, mean = 3.0, sd = 0.5),
  hoof_trimming = sample(c("Regular", "Irregular"), 300, replace = TRUE),
  floor_type = sample(c("Concrete", "Rubber"), 300, replace = TRUE)
) %>%
  mutate(
    lameness_prob = plogis(-2 + 0.3 * age_years - 0.5 * bcs +
                           1.2 * (hoof_trimming == "Irregular") -
                           0.8 * (floor_type == "Rubber")),
    lameness = rbinom(300, 1, prob = lameness_prob)
  )

Tasks: 1. Fit a logistic regression model with lameness as the outcome and all other variables (except cow_id and lameness_prob) as predictors 2. Create a table of odds ratios with 95% confidence intervals 3. Identify which variables are statistically significant 4. Interpret the odds ratio for hoof_trimming in context 5. Create predictions for these three cow profiles: - Young cow (3 years), BCS 3.5, Regular trimming, Rubber floor - Old cow (7 years), BCS 2.5, Irregular trimming, Concrete floor - Average cow (5 years), BCS 3.0, Regular trimming, Concrete floor 6. Visualize predicted probability of lameness by age, with separate lines for regular vs irregular hoof trimming

27.1.5 Part 4: Reflection (10 points)

Write a 250-300 word reflection addressing:

  1. When would you use chi-square vs Fisher’s exact test?
  2. What’s the key advantage of logistic regression over simple chi-square tests?
  3. Describe one challenge you encountered in this week’s material and how you addressed it
  4. How might you apply these methods in your own research or field of interest?

27.1.6 Submission Checklist

27.1.7 Grading Rubric

  • Code correctness (40%): Code runs, uses appropriate functions, no major errors
  • Interpretation (40%): Correct interpretation of results, addresses all questions
  • Presentation (15%): Clear visualizations, well-formatted tables, organized document
  • Reflection (5%): Thoughtful responses, demonstrates understanding

Due Date: [To be specified by instructor]

Estimated Time: 3-4 hours

Good luck! Remember to use the course resources and ask questions if you get stuck.